title.png

New York is one of the world's most populous megacities. New York City has been described as the cultural, financial, and media capital of the world, and is a significant influence on commerce, entertainment, research, technology, education, politics, tourism, dining, art, fashion, and sports. It is the most photographed city in the world. NYC attracts people from various cultures and income groups.

A city such huge would also become an epicenter of criminal activity. Using statistical analysis we will try to find some trends in crimes, and predict crime rates using spatial-temporal analytics.

Dataset¶

We are using the dataset NYPD Complaint Data from NYC open data. This dataset includes all valid felony, misdemeanor, and violation crimes reported to the New York City Police Department (NYPD) from 2006 to the end of last year (2020).

Toolset¶

  • Python - Worlds most prefered language to perform data analysis and machine
  • Pandas - pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.
  • Matplotlib, plotly and seaborn - for visualization

First steps¶

Importing the libraries we require

In [ ]:
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import geopandas
from geopandas import GeoDataFrame
from geopandas import points_from_xy
from geopandas import read_file as gp_readfile
import geoplot
from prophet import Prophet
c:\Users\saiki\anaconda3\envs\nyc_crime\lib\site-packages\tqdm\auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
Importing plotly failed. Interactive plots will not work.

We create a dataframe from the csv using pandas, all our analysis we be based on this.

In [ ]:
df = pd.read_csv("./NYPD_Complaint_Data_Historic.csv")
C:\Users\saiki\AppData\Local\Temp\ipykernel_16500\2019213648.py:1: DtypeWarning: Columns (18,20) have mixed types. Specify dtype option on import or set low_memory=False.
  df = pd.read_csv("./NYPD_Complaint_Data_Historic.csv")

Our data is huge¶

Dataset this large may introduce lot of noise and bias our patterns which are outdated. For example and increase in police funding may influence crime patterns in recent years, hence we will create a subset of the data and analyze crime since 2016.

In [ ]:
df['REPORTED_DATE'] =  pd.to_datetime(df['RPT_DT'], format='%m/%d/%Y', errors='coerce')
df['TIME'] = pd.to_datetime(df.CMPLNT_FR_TM, errors='coerce').dt.hour
df['YEAR'] =  df.REPORTED_DATE.dt.year
df['MONTH'] =  df.REPORTED_DATE.dt.month
df = df[df.YEAR.gt(2016)]

Convert to geodata frame for spatial analysis

In [ ]:
crs={'init':'epsg:4326'}
geometry = points_from_xy(df['Longitude'],df['Latitude'], crs=crs)
df = GeoDataFrame(df, geometry=geometry)
c:\Users\saiki\anaconda3\envs\nyc_crime\lib\site-packages\pyproj\crs\crs.py:130: FutureWarning: '+init=<authority>:<code>' syntax is deprecated. '<authority>:<code>' is the preferred initialization method. When making the change, be mindful of axis order changes: https://pyproj4.github.io/pyproj/stable/gotchas.html#axis-order-changes-in-proj-6
  in_crs_string = _prepare_from_proj_string(in_crs_string)

Dataset Preprocessing¶

Data preprocessing can refer to manipulation or dropping of data before it is used in order to ensure or enhance performance, and is an important step in the data mining process.

We start by taking a look into the shape of the dataset, understanding what kind of data we would be dealing with.

In [ ]:
df.info()
<class 'geopandas.geodataframe.GeoDataFrame'>
Int64Index: 1808176 entries, 0 to 7375992
Data columns (total 40 columns):
 #   Column             Dtype         
---  ------             -----         
 0   CMPLNT_NUM         int64         
 1   CMPLNT_FR_DT       object        
 2   CMPLNT_FR_TM       object        
 3   CMPLNT_TO_DT       object        
 4   CMPLNT_TO_TM       object        
 5   ADDR_PCT_CD        float64       
 6   RPT_DT             object        
 7   KY_CD              int64         
 8   OFNS_DESC          object        
 9   PD_CD              float64       
 10  PD_DESC            object        
 11  CRM_ATPT_CPTD_CD   object        
 12  LAW_CAT_CD         object        
 13  BORO_NM            object        
 14  LOC_OF_OCCUR_DESC  object        
 15  PREM_TYP_DESC      object        
 16  JURIS_DESC         object        
 17  JURISDICTION_CODE  float64       
 18  PARKS_NM           object        
 19  HADEVELOPT         object        
 20  HOUSING_PSA        object        
 21  X_COORD_CD         float64       
 22  Y_COORD_CD         float64       
 23  SUSP_AGE_GROUP     object        
 24  SUSP_RACE          object        
 25  SUSP_SEX           object        
 26  TRANSIT_DISTRICT   float64       
 27  Latitude           float64       
 28  Longitude          float64       
 29  Lat_Lon            object        
 30  PATROL_BORO        object        
 31  STATION_NAME       object        
 32  VIC_AGE_GROUP      object        
 33  VIC_RACE           object        
 34  VIC_SEX            object        
 35  REPORTED_DATE      datetime64[ns]
 36  TIME               float64       
 37  YEAR               int64         
 38  MONTH              int64         
 39  geometry           geometry      
dtypes: datetime64[ns](1), float64(9), geometry(1), int64(4), object(25)
memory usage: 565.6+ MB

Shape¶

Lets look at the number of entries and columns we have

In [ ]:
df.shape
Out[ ]:
(1808176, 40)

Analyzing the missing data in the dataset¶

In [ ]:
percent_missing = round(df.isna().sum() / len(df) * 100)
pd.DataFrame({'column_name': df.columns,
                                 'percent_missing': percent_missing})
Out[ ]:
column_name percent_missing
CMPLNT_NUM CMPLNT_NUM 0.0
CMPLNT_FR_DT CMPLNT_FR_DT 0.0
CMPLNT_FR_TM CMPLNT_FR_TM 0.0
CMPLNT_TO_DT CMPLNT_TO_DT 13.0
CMPLNT_TO_TM CMPLNT_TO_TM 13.0
ADDR_PCT_CD ADDR_PCT_CD 0.0
RPT_DT RPT_DT 0.0
KY_CD KY_CD 0.0
OFNS_DESC OFNS_DESC 0.0
PD_CD PD_CD 0.0
PD_DESC PD_DESC 0.0
CRM_ATPT_CPTD_CD CRM_ATPT_CPTD_CD 0.0
LAW_CAT_CD LAW_CAT_CD 0.0
BORO_NM BORO_NM 0.0
LOC_OF_OCCUR_DESC LOC_OF_OCCUR_DESC 18.0
PREM_TYP_DESC PREM_TYP_DESC 0.0
JURIS_DESC JURIS_DESC 0.0
JURISDICTION_CODE JURISDICTION_CODE 0.0
PARKS_NM PARKS_NM 99.0
HADEVELOPT HADEVELOPT 96.0
HOUSING_PSA HOUSING_PSA 93.0
X_COORD_CD X_COORD_CD 0.0
Y_COORD_CD Y_COORD_CD 0.0
SUSP_AGE_GROUP SUSP_AGE_GROUP 25.0
SUSP_RACE SUSP_RACE 25.0
SUSP_SEX SUSP_SEX 25.0
TRANSIT_DISTRICT TRANSIT_DISTRICT 98.0
Latitude Latitude 0.0
Longitude Longitude 0.0
Lat_Lon Lat_Lon 0.0
PATROL_BORO PATROL_BORO 0.0
STATION_NAME STATION_NAME 98.0
VIC_AGE_GROUP VIC_AGE_GROUP 0.0
VIC_RACE VIC_RACE 0.0
VIC_SEX VIC_SEX 0.0
REPORTED_DATE REPORTED_DATE 0.0
TIME TIME 0.0
YEAR YEAR 0.0
MONTH MONTH 0.0
geometry geometry 0.0

We could drop the rows with invalid/empty values which we are not interested in to make the data frame lighter to process, since it is so obviously huge. But we also subsitute invalid/empty values with UNKNOWN for important metrics which contribute to our analysis.

The size of the dataset is 2.19 GB.

In [ ]:
df.dropna(subset=['Y_COORD_CD','X_COORD_CD','Latitude','Longitude','CRM_ATPT_CPTD_CD','CMPLNT_FR_TM','Lat_Lon','CMPLNT_FR_DT','BORO_NM','OFNS_DESC','ADDR_PCT_CD'], inplace=True)
df.drop(['PARKS_NM','STATION_NAME','TRANSIT_DISTRICT','HADEVELOPT','HOUSING_PSA'],axis='columns', inplace=True)
df.drop(['JURISDICTION_CODE'], axis='columns', inplace=True)
df.drop(['PD_CD','PD_DESC','PATROL_BORO','CMPLNT_TO_DT','CMPLNT_TO_TM'], axis='columns', inplace=True)
df.fillna({'LOC_OF_OCCUR_DESC':'UNKNOWN'}, inplace=True)
df.fillna({'VIC_RACE':'UNKNOWN'}, inplace=True)
df.fillna({'VIC_AGE_GROUP':'UNKNOWN'}, inplace=True)
df.fillna({'VIC_SEX':'UNKNOWN'}, inplace=True)
df.fillna({'SUSP_RACE':'UNKNOWN'}, inplace=True)
df.fillna({'SUSP_AGE_GROUP':'UNKNOWN'}, inplace=True)
df.fillna({'SUSP_SEX':'UNKNOWN'}, inplace=True)

Quick check on types of crimes¶

In [ ]:
df["OFNS_DESC"].unique()
Out[ ]:
array(['DANGEROUS WEAPONS', 'FORGERY', 'HARRASSMENT 2',
       'MISCELLANEOUS PENAL LAW', 'BURGLARY', 'DANGEROUS DRUGS',
       'PETIT LARCENY', 'OFF. AGNST PUB ORD SENSBLTY &', 'GRAND LARCENY',
       'FELONY ASSAULT', 'ASSAULT 3 & RELATED OFFENSES', 'ARSON', 'RAPE',
       'SEX CRIMES', 'GRAND LARCENY OF MOTOR VEHICLE', 'ROBBERY',
       'CRIMINAL MISCHIEF & RELATED OF', 'THEFT-FRAUD',
       'VEHICLE AND TRAFFIC LAWS', 'CRIMINAL TRESPASS',
       'OFFENSES INVOLVING FRAUD', 'FRAUDS',
       'OFFENSES AGAINST PUBLIC ADMINI', 'OFFENSES AGAINST THE PERSON',
       'ADMINISTRATIVE CODE', 'INTOXICATED & IMPAIRED DRIVING',
       'ESCAPE 3', 'NYS LAWS-UNCLASSIFIED FELONY',
       'POSSESSION OF STOLEN PROPERTY', 'THEFT OF SERVICES',
       'KIDNAPPING & RELATED OFFENSES', 'OTHER OFFENSES RELATED TO THEF',
       'UNAUTHORIZED USE OF A VEHICLE', "BURGLAR'S TOOLS",
       'ENDAN WELFARE INCOMP', 'FRAUDULENT ACCOSTING',
       'AGRICULTURE & MRKTS LAW-UNCLASSIFIED',
       'OTHER STATE LAWS (NON PENAL LA', 'OFFENSES AGAINST PUBLIC SAFETY',
       'GAMBLING', 'PETIT LARCENY OF MOTOR VEHICLE',
       'ALCOHOLIC BEVERAGE CONTROL LAW', 'OFFENSES RELATED TO CHILDREN',
       'ANTICIPATORY OFFENSES', 'LOITERING/GAMBLING (CARDS, DIC',
       'FELONY SEX CRIMES', 'HOMICIDE-NEGLIGENT,UNCLASSIFIE',
       'PROSTITUTION & RELATED OFFENSES', 'JOSTLING',
       'CHILD ABANDONMENT/NON SUPPORT', 'OTHER STATE LAWS', 'KIDNAPPING',
       'NYS LAWS-UNCLASSIFIED VIOLATION', 'DISORDERLY CONDUCT',
       'DISRUPTION OF A RELIGIOUS SERV', 'OFFENSES AGAINST MARRIAGE UNCL',
       'HOMICIDE-NEGLIGENT-VEHICLE', 'INTOXICATED/IMPAIRED DRIVING',
       'KIDNAPPING AND RELATED OFFENSES',
       'UNLAWFUL POSS. WEAP. ON SCHOOL', 'OTHER TRAFFIC INFRACTION',
       'OTHER STATE LAWS (NON PENAL LAW)', 'FORTUNE TELLING', 'LOITERING',
       'NEW YORK CITY HEALTH CODE', 'ABORTION'], dtype=object)

Data Corrections¶

The data shows some issues is type of descriptions so we correct them for example "ASSAULT 3 & RELATED OFFENSES" can be ASSAULT

In [ ]:
df_clean = df.replace({'HARRASSMENT 2': 'HARASSMENT', 
                'ESCAPE 3': 'ESCAPE',
                'ASSAULT 3 & RELATED OFFENSES': 'ASSAULT & RELATED OFFENSES',
                'CRIMINAL MISCHIEF & RELATED OF': 'CRIMINAL MISCHIEF',
                'OFF. AGNST PUB ORD SENSBLTY &': 'OFFENSES AGAINST PUBLIC ORDER/ADMINISTRATION',
                'OTHER STATE LAWS (NON PENAL LA': 'OTHER STATE LAWS (NON PENAL LAW)',
                'ENDAN WELFARE INCOMP': 'ENDANGERING WELFARE OF INCOMPETENT',
                'AGRICULTURE & MRKTS LAW-UNCLASSIFIED': 'AGRICULTURE & MARKETS LAW',
                'DISRUPTION OF A RELIGIOUS SERV': 'DISRUPTION OF A RELIGIOUS SERVICE',
                'LOITERING/GAMBLING (CARDS, DIC': 'GAMBLING',
                'OFFENSES AGAINST MARRIAGE UNCL': 'OFFENSES AGAINST MARRIAGE',
                'HOMICIDE-NEGLIGENT,UNCLASSIFIE': 'HOMICIDE-NEGLIGENT',
                'E': 'UNKNOWN',
                'D': 'BUSINESS/ORGANIZATION',
                'F': 'FEMALE',
                'M': 'MALE'})
df_clean['TIME'] = df_clean['TIME'].astype('int64')
df_clean['ADDR_PCT_CD'] = df_clean['ADDR_PCT_CD'].astype('int64')
In [ ]:
df_clean.head()
Out[ ]:
CMPLNT_NUM CMPLNT_FR_DT CMPLNT_FR_TM ADDR_PCT_CD RPT_DT KY_CD OFNS_DESC CRM_ATPT_CPTD_CD LAW_CAT_CD BORO_NM ... Longitude Lat_Lon VIC_AGE_GROUP VIC_RACE VIC_SEX REPORTED_DATE TIME YEAR MONTH geometry
0 394506329 12/31/2019 17:30:00 32 12/31/2019 118 DANGEROUS WEAPONS COMPLETED FELONY MANHATTAN ... -73.943324 (40.82092679700002, -73.94332421899996) UNKNOWN UNKNOWN UNKNOWN 2019-12-31 17 2019 12 POINT (-73.94332 40.82093)
1 968873685 12/29/2019 16:31:00 47 12/29/2019 113 FORGERY COMPLETED FELONY BRONX ... -73.861640 (40.885701406000074, -73.86164032499995) UNKNOWN UNKNOWN UNKNOWN 2019-12-29 16 2019 12 POINT (-73.86164 40.88570)
2 509837549 12/15/2019 18:45:00 109 12/29/2019 578 HARASSMENT COMPLETED VIOLATION QUEENS ... -73.819824 (40.74228115600005, -73.81982408) 25-44 WHITE HISPANIC FEMALE 2019-12-29 18 2019 12 POINT (-73.81982 40.74228)
3 352454313 12/28/2019 01:00:00 47 12/28/2019 126 MISCELLANEOUS PENAL LAW COMPLETED FELONY BRONX ... -73.847545 (40.87531145100007, -73.84754521099995) UNKNOWN UNKNOWN UNKNOWN 2019-12-28 1 2019 12 POINT (-73.84755 40.87531)
5 293718737 12/27/2019 22:00:00 9 12/27/2019 107 BURGLARY ATTEMPTED FELONY MANHATTAN ... -73.980466 (40.72075882100006, -73.98046642299995) UNKNOWN UNKNOWN MALE 2019-12-27 22 2019 12 POINT (-73.98047 40.72076)

5 rows × 29 columns

TOP 10 crimes of NYC¶

In [ ]:
df_clean.OFNS_DESC.value_counts().iloc[:10].sort_values().plot(kind="barh", title = "Types of Crimes", figsize=(20,10), color = sns.color_palette("gist_rainbow_r"))
Out[ ]:
<AxesSubplot:title={'center':'Types of Crimes'}>

Based on the data above the major crime that is commited is PETIT LARCENEY

What is PETIT LARCENEY¶

The most common form of Petit Larceny is shoplifting. This charge will apply when a person takes items from a store (unless the items are worth more than $1,000). A person can be charged with this crime even if they don’t leave the store with the items. For example, if someone places an item into their pocket or bag, they might be charged on the ground that they were concealing the item and intending to steal it.

Petit Larceny does not only apply to shoplifting. People have been charged with this crime in New York State for using a doctored MetroCard at subway turnstiles, for removing a landlord’s surveillance cameras from a rental property, and for taking mail from a mailbox. Petit Larceny can also be charged when someone possesses another person’s property and refuses to return it—for example, if an acquaintance loans you a cell phone and you walk off with it.

Levels of crime¶

Felonies are the most serious kinds of crimes. Generally, a crime is considered a felony when it is punishable by more than a year in a state prison (also called a penitentiary). Examples of felonies are murder, rape, burglary, and the sale of illegal drugs.

Misdemeanors are less serious crimes, and are typically punishable by up to a year in county jail. Common misdemeanors include shoplifting, drunk driving, assault, and possession of an unregistered firearm. Often, an offense that is a misdemeanor the first time a person commits it becomes a felony the second time around.

Violation are still less serious violations, like those involving traffic laws, which typically subject a person to nothing more than a monetary fine. Defendants charged with infractions usually have no right to a jury trial or a court-appointed lawyer. But repeat offenders, even when the offense is a mere infraction, may face stiffer penalties or charges. (Some states consider certain kinds of infractions like traffic tickets to be civil, rather than criminal, offenses.)

In [ ]:
df_clean.LAW_CAT_CD.value_counts().plot(kind='pie', figsize=(15,10), colors=sns.color_palette("cool"), legend=True, autopct='%1.2f%%', explode=(0, 0, 0.20), shadow=False, startangle=0, title="Levels of law")
Out[ ]:
<AxesSubplot:title={'center':'Levels of law'}, ylabel='LAW_CAT_CD'>

We have an interesting insight that brooklyn has the highes crimes vs staten island which is the lowest¶

Racial analysis¶

In [ ]:
data_vic_susp_race = df_clean[['VIC_RACE', 'SUSP_RACE']].apply(pd.Series.value_counts).reindex(index = ["BLACK", "WHITE HISPANIC", "WHITE", "BLACK HISPANIC", "ASIAN / PACIFIC ISLANDER", "AMERICAN INDIAN/ALASKAN NATIVE"])
ax = data_vic_susp_race.plot(kind="barh", color =sns.color_palette("twilight_shifted_r"), title = 'Racial analysis', figsize=(15,10))
ax.legend(["Victim Race", "Suspect Race"])
Out[ ]:
<matplotlib.legend.Legend at 0x196b6384d00>

Based on the data we see we notice BLACK people have highest crime rate in terms of victims and suspects

Age analysis¶

In [ ]:
data_vic_susp_age = df_clean[['VIC_AGE_GROUP', 'SUSP_AGE_GROUP']].apply(pd.Series.value_counts).reindex(index = ["<18", "18-24", "25-44", "45-64", "65+"])
ax = data_vic_susp_age.plot(kind="barh", color = sns.color_palette("terrain"), title = 'Age analysis', figsize=(15,10))
ax.legend(["Victim Age", "Suspect age"])
Out[ ]:
<matplotlib.legend.Legend at 0x1972699b490>

We see here that most of the crimes and victims are in the age group of 25-44

Gender Analysis¶

In [ ]:
data_vic_susp_sex = df_clean[['VIC_SEX', 'SUSP_SEX']].apply(pd.Series.value_counts).reindex(index = ["MALE", "FEMALE"])
ax = data_vic_susp_sex.plot(kind="barh", color = sns.color_palette("magma_r"), title = 'Gender analysis', figsize=(15,10))
ax.legend(["Victim Sex", "Suspect sex"])
Out[ ]:
<matplotlib.legend.Legend at 0x196bfa560a0>

With gender analysis in place we notice there are huge amount of female victims, typically female victims face sex crimes hence lets filter by sex crimes to gain more insight.

In [ ]:
sex_crimes_filtered = df_clean[df_clean.OFNS_DESC.str.contains('SEX CRIMES|RAPE', na=False)]
ax = sex_crimes_filtered[['VIC_SEX', 'SUSP_SEX']].apply(pd.Series.value_counts).reindex(index = ["MALE", "FEMALE"]).plot(kind="barh", color = sns.color_palette("rocket_r"), title = 'Sex crime analysis', figsize=(15,10))
ax.legend(["Victim Sex", "Suspect sex"])
Out[ ]:
<matplotlib.legend.Legend at 0x196f37c8ee0>

This graph confirms our intuition about sexual crimes, female victims face the highest sexua crimes and male suspects are the highest

In [ ]:
sex_crimes_filtered_female = df_clean[df.OFNS_DESC.str.contains('SEX CRIMES|RAPE', na=False) & df_clean.VIC_SEX.eq('FEMALE')]
sex_crimes_filtered_vic_age = sex_crimes_filtered_female[['VIC_AGE_GROUP']].apply(pd.Series.value_counts).reindex(index = ["<18", "18-24", "25-44", "45-64", "65+"])
ax = sex_crimes_filtered_vic_age.plot(kind="bar", color = sns.color_palette("cool_r"), title = 'Sex crimes victims by age analysis', figsize=(15,10), legend=False)

We see here that majority of the female victims are minors, followed by middle age women¶

In [ ]:
sex_crimes_filtered_female.PREM_TYP_DESC.value_counts().iloc[:10].sort_values().plot(kind="barh", title = "Analyze top 10 places of female sex crimes", figsize=(20,10), color = sns.color_palette("cool"))
Out[ ]:
<AxesSubplot:title={'center':'Analyze top 10 places of female sex crimes'}>

We see that majority of sex crimes happen in or around the place of residence

Lets sort the data to see data for minors¶

In [ ]:
sex_crimes_filtered_female_minors = df_clean[df_clean.OFNS_DESC.str.contains('SEX CRIMES|RAPE', na=False) & df_clean.VIC_SEX.eq('FEMALE') & df.VIC_AGE_GROUP.eq('<18')]
sex_crimes_filtered_female_minors.PREM_TYP_DESC.value_counts().iloc[:10].sort_values().plot(kind="barh", title = "Analyze top 10 places of female sex crimes of minors", figsize=(20,10), color = sns.color_palette("cool_r"))
Out[ ]:
<AxesSubplot:title={'center':'Analyze top 10 places of female sex crimes of minors'}>

This follows a similar pattern to that of female group in large

In [ ]:
ax = sex_crimes_filtered[['VIC_RACE', 'SUSP_RACE']].apply(pd.Series.value_counts).reindex(index = ["BLACK", "WHITE HISPANIC", "WHITE", "BLACK HISPANIC", "ASIAN / PACIFIC ISLANDER", "AMERICAN INDIAN/ALASKAN NATIVE"]).plot(kind="barh", color = sns.color_palette("magma"), title = 'Racially classified Sex crime analysis', figsize=(15,10))
ax.legend(["Victim Sex", "Suspect sex"])
Out[ ]:
<matplotlib.legend.Legend at 0x19695e92fd0>

We notice here that black women face the most sex crimes and most perperators are black¶

In [ ]:
ax = sex_crimes_filtered_female_minors[['VIC_RACE', 'SUSP_RACE']].apply(pd.Series.value_counts).reindex(index = ["BLACK", "WHITE HISPANIC", "WHITE", "BLACK HISPANIC", "ASIAN / PACIFIC ISLANDER", "AMERICAN INDIAN/ALASKAN NATIVE"]).plot(kind="barh", color = sns.color_palette("magma_r"), title = 'Racially classified Sex crime analysis (minors)', figsize=(15,10))
ax.legend(["Victim Sex", "Suspect sex"])
Out[ ]:
<matplotlib.legend.Legend at 0x196bf8c9e20>

We notice here that white minors are more prone to sexual crimes, closely followed by black minors face the most sex crimes. Perperators seem to be both white and black.¶

But¶

Is our intuition that really women are more prone to sex crimes?

In [ ]:
df_clean[df_clean.VIC_SEX.eq('FEMALE')].OFNS_DESC.value_counts().iloc[:20].sort_values().plot(kind="barh", title = "Types of Crimes", figsize=(20,10), color = sns.color_palette("mako"))
Out[ ]:
<AxesSubplot:title={'center':'Types of Crimes'}>

We see that sex crimes are really low compared to other forms of crime.¶

Temporal Analysis¶

By year¶

We notice by year the overal crime rate has been on decline this could probably, may be because of onset of COVID in 2019. More people started staying in homes, self-isolating.

In [ ]:
df_clean.groupby('YEAR').size().plot(kind="line", title = "Total Crime Events by Year", figsize=(15,10), color = "coral")
Out[ ]:
<AxesSubplot:title={'center':'Total Crime Events by Year'}, xlabel='YEAR'>

By Month¶

In [ ]:
df_clean.groupby('MONTH').size().plot(kind = 'bar', title ='Total Crime Events by Month', figsize=(15,10),  color = sns.color_palette("cool_r") ,rot=0)
Out[ ]:
<AxesSubplot:title={'center':'Total Crime Events by Month'}, xlabel='MONTH'>

We notice the crime rate is a bit higher around the summers.

Homes may be more tempting to criminals because windows and doors are left open more frequently, and homeowners often spend less time at home when the weather is pleasant. Many criminals are opportunists, and opportunities present themselves more frequently during the summer months. According to studies, the reason may actually be quite simple. As temperatures rise, many people are generally uncomfortable. This discomfort can give rise to aggression which could lead to aggressive criminal activity.

By hours of day¶

In [ ]:
df_clean.groupby('TIME').size().plot(kind = 'bar', title ='Total Crime Events by Day', figsize=(15,10), color = sns.color_palette("viridis"), xlabel = 'hours',rot=0)
Out[ ]:
<AxesSubplot:title={'center':'Total Crime Events by Day'}, xlabel='hours'>

We can see that the crime is higher around the evening and lunch, we typically assume crime to be higher in the night. But this trend could be possibly due to the fact that it its around the time of commute when most people roam around.

Sex crimes by the hour of the day¶

In [ ]:
sex_crimes_filtered.groupby('TIME').size().plot(kind = 'bar', title ='Total Sex Crime Events by Day', figsize=(15,10), color = sns.color_palette("spring_r"), xlabel = 'hours',rot=0)
Out[ ]:
<AxesSubplot:title={'center':'Total Sex Crime Events by Day'}, xlabel='hours'>

From the data we can see the early mornings are the safest for women but midnight is when most of the sex crimes happen.

Spatial analysis¶

In [ ]:
df_clean['BORO_NM'].value_counts().sort_values().plot(kind="barh", color = sns.color_palette("seismic_r"), title = 'Crime by Borough', figsize=(15,10))
Out[ ]:
<AxesSubplot:title={'center':'Crime by Borough'}>

This figure gives us a quick picture on how safe different areas are for their safety, but police in NYC are not divided by boroughs NYC rather into precincts. Digging deeper into a precinct, could help us predict crime rate better.

By Precinct¶

In [ ]:
# impoert precincts data file to overlay on maps
pcints = gp_readfile("./Police Precincts.geojson").to_crs(epsg=4326)
pcints = pcints.rename(columns = {"precinct": "ADDR_PCT_CD"})
pcints.ADDR_PCT_CD = df_clean.ADDR_PCT_CD.astype('int64')
In [ ]:
#we plot the  crime data with the precincts
sx3 = df_clean.groupby("ADDR_PCT_CD").ADDR_PCT_CD.count().to_frame()
c = pcints.join(sx3, on="ADDR_PCT_CD", how="left", lsuffix="ptable_")
c = c.dropna()
c.explore(
     column="ADDR_PCT_CD", # make choropleth based on "BoroName" column
     tooltip="ADDR_PCT_CD", # show "BoroName" value in tooltip (on hover)
     popup=True, # show all values in popup (on click)
     tiles="CartoDB positron", # use "CartoDB positron" tiles
     cmap="Reds", # use "Set1" matplotlib colormap
     style_kwds=dict(color="black"), # use black outline
     marker_kwds=dict(radius=10, fill=True)
    )
Out[ ]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Spatial-Temporal Forecast¶

So far we made use of the data we have to have analytical glance of data. We see from the above map that each precinct in a city have different crime rates. Predicting crimes on the whole for the city is not useful, since each precicnt may have diffrent response for each level.

Seasonality¶

We also learnt from previous data that crimes are seasonal, our dataset serves as a time series to create a temporal dimension, and our GEOGRAPHICAL awareness in the dataset provides us to evaulate each precinct seperately and forecast the data into the future.

But just before that¶

Crimes could be divided into two types oppurtunistic and planned. Planned crimes like Burglary, Grand Theft Auto and Grand Larceny are observed seasonally. The are meticulously planned for a successful outcome and follow a specific trend.

Whereas oppurtunistic crimes happen whenever a perpetuator sees an opportunity to commit crime, like robbery.

Data split¶

We will be using the crime data from 2017 to 2019 to predict the crimes of 2020, we have 2020 data in the dataset handy to use it as our test data.

Focus points¶

We would be forecasting both burglaries and robberies, since burglary is a planned crime whereas robbery is a oppurunstic crime. This would allow us to build a base model using which we can predict other combinations of crimes.

why are we mixing opportunistic and planned crime in the same analysis? Since crime follows sesonality. We understood that in our temporal analysis. While the success rates of both crimes may differ, they still follow a pattern.

In [ ]:
# test data split
df_burglary_train = df_clean[df_clean.ADDR_PCT_CD.eq(75) & df_clean.OFNS_DESC.str.contains("BURGLARY|GRAND|ROBBERY") & df_clean.YEAR.lt(2020)].groupby(pd.Grouper(key='REPORTED_DATE', freq='D')).agg(count=('REPORTED_DATE', 'count')).reset_index().rename(columns={"REPORTED_DATE": "ds", "count": "y"}).groupby(pd.Grouper(key='ds', freq='MS')).agg('count').reset_index().rename(columns={"ds": "ds", "y": "y"})

Forecasting¶

We will be using Facebook's Prophet library for our forecast.

Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well.

Prophet is used in many applications across Facebook for producing reliable forecasts for planning and goal setting.

Our use case perfectly fits to the tool in hand, PERFECT!

In [ ]:
m = Prophet(changepoint_range = 0.5, yearly_seasonality=True, changepoint_prior_scale=0.16).fit(df_burglary_train)

future = m.make_future_dataframe(periods=12, freq='MS')
fcst = m.predict(future)
INFO:prophet:Disabling weekly seasonality. Run prophet with weekly_seasonality=True to override this.
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
c:\Users\saiki\anaconda3\envs\nyc_crime\lib\site-packages\prophet\forecaster.py:896: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  components = components.append(new_comp)
INFO:prophet:n_changepoints greater than number of observations. Using 17.
c:\Users\saiki\anaconda3\envs\nyc_crime\lib\site-packages\prophet\forecaster.py:896: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  components = components.append(new_comp)
c:\Users\saiki\anaconda3\envs\nyc_crime\lib\site-packages\prophet\forecaster.py:896: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  components = components.append(new_comp)
In [ ]:
fig = m.plot(fcst, figsize=(15, 10))

TESTING¶

While we got our predictions, we still need to test our data for accuracy, hence we utilise our test data to score our predictions.

In [ ]:
from sklearn.metrics import mean_absolute_error
In [ ]:
# Split the test data from main data
g1 = df_clean[df_clean.ADDR_PCT_CD.eq(75) & df_clean.OFNS_DESC.str.contains("BURGLARY|GRAND|ROBBERY") & df_clean.YEAR.gt(2019)].groupby(pd.Grouper(key='REPORTED_DATE', freq='D')).agg(count=('REPORTED_DATE', 'count')).reset_index().rename(columns={"REPORTED_DATE": "ds", "count": "y"})
df_burglary_test = g1.groupby(pd.Grouper(key='ds', freq='MS')).agg('count').reset_index().rename(columns={"ds": "ds", "y": "y"})
In [ ]:
# find Mean Absolute Error and plot the graph
y_true = df_burglary_test['y'].values
y_pred = fcst['yhat'][-12:].values
mae = mean_absolute_error(y_true, y_pred)
print('MAE: %.3f' % mae)

plt.figure(figsize=(15,10))
plt.plot(y_true, label='Actual')
plt.plot(y_pred, label='Predicted')
plt.legend()
plt.set_cmap(sns.color_palette("twilight_shifted", as_cmap=True))
plt.show()
MAE: 0.215

Closing notes¶

Crime has an interesting, pattern when looked at a spatial temporal approach. Is it a marker to rely on? Patterns seems to be consistent over the years. Which help us make a forecast with an accuracy.

But¶

Crimes trends are influenced with various external factors, though the patterns remain similar. Trends help us be prepared while patterns help us to understand them.

Quick fact¶

Few decades ago cars used leaded fuel, at the peak of its usage the lead caused so many issues in brain development. Crimes rates increased by the mid 80's. Then once govts started phasing out leaded fuel, crime trends dropped. This research was conducted by several envrionmental study groups.

Our analyisis dived into crime patterns in NYC, and we have observed various insights and forecasted the crime patterns.

Improvements¶

Our forecast focusses on spatial-temporal dimensions and seasonalities. It needs to factor in more markers like local education, police funding etc.